Some quick cleaning

The dataset came from Kaggle but was still a little messy, not sure why the artists column came wrapped in [‘name’]. Additionally some date formatting was inconsistent (as always), I imputed -01-01 if there was no date after the year.

Looking At the Data Using Lares

Look at percentages and cumulatives, looks like no popularity is very common!

There is about the same amount of music every year in this dataframe

Basically newer stuff is more popular with little to no exceptions

Looks like explicitness has a normal distribution compared to popularity

Now we check out the distribution, there’s some really cool stuff here

There’s also some really long songs out there…

This looks more like the actual distribution

Very interesting distribution here

Looks like things are a lot more explicit in 2000-2020 as one might expect, would be interesting to see how when this starts, or what drives it. I also wonder what happened in 1920-1940?

By the way mode is just whether the song is major or minor.

You can even use ggplot2!

Wouldn’t be data science without some random regressions, even more data science/machine learningy since the second one is a log odds table!

variables corr pvalue
popularity_log 0.890732 0
acousticness -0.573162 0
acousticness_log -0.55757 0
energy_log 0.488822 0
energy 0.485005 0
loudness 0.457051 0
instrumentalness_log -0.300402 0
instrumentalness -0.29675 0
danceability 0.199606 0
danceability_log 0.196287 0

Fun with Machine Learning